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Abstract. We show that forms of Bayesian and MDL inference that 
are often apphed to classification problems can be inconsistent. This 
means there exists a learning problem such that for all amounts of data 
the generalization errors of the MDL classifier and the Bayes classifier 
relative to the Bayesian posterior both remain bounded away from the 
p-H , smallest achievable generalization error. 

^ ' 1 Introduction 

-4— > 

Overfitting is a central concern of machine learning and statistics. Two frequently 
used learning methods that in many cases 'automatically' protect against over- 
fitting are Bayesian inference [5] and the Minimum Description Length (MDL) 
Principle [21, 2, 11]. We show that, when applied to classification problems, some 
of the standard variations of these two methods can be inconsistent in the sense 
that they asymptotically overfit: there exist scenarios where, no matter how much 
X^ , data is available, the generalization error of a classifier based on MDL or the full 

\^^ ' Bayesian posterior does not converge to the minimum achievable generalization 

^^ , error within the set of classifiers under consideration. 

^D ' Some Caveats and Warnings These result must be interpreted carefully. There 

fH , exist many different versions of MDL and Bayesian inference, only some of which 

are covered. For the case of MDL, we show our result for a two-part form of MDL 

a that has often been used for classification, see Section 7. For the case of Bayes, our 

result may appear to contradict some well-known Bayesian consistency results 
^ , [6]. Indeed, our result only applies to a 'pragmatic' use of Bayes, where the set 

l^ ' of hypotheses under consideration are classifiers: functions mapping each input 

I'jjj , X to a discrete class label Y. To apply Bayes rule, these classifiers must be 

C^ ' converted into conditional probability distributions. We do this conversion in a 

standard manner, crossing a prior on classifiers with a prior on error rates for 
these classifiers. This may lead to (sometimes subtly) 'misspecified' probability 
models not containing the 'true' distribution D. Thus, our result may be restated 
as 'Bayesian methods for classification can be inconsistent under misspecification 
for common classification probability models'. The result is still interesting, since 
(1) even under misspecification, Bayesian inference is known to be consistent 
under fairly broad conditions - we provide an explicit context in which it is 
not; (2) in practice, Bayesian inference is used frequently for classification under 
misspecification - see Section 6. 
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1.1 A Preview 

Classification Problems A classification problem is defined on an input (or 
feature) domain X and output domain (or class label) y = {0, 1}. The problem 
is defined by a probability distribution D over A" x 3^. A classifier is a function 
c : X —f y The error rate of any classifier is quantified as: 

eoic) = E(^^^yi^Dl{c{x) ^ y) 

where (x, y) ^ D denotes a draw from the distribution D and /(•) is the indicator 
function which is 1 when its argument is true and otherwise. 

The goal is to find a classifier which, as often as possible according to D, 
correctly predicts the class label given the input feature. Typically, the classifi- 
cation problem is solved by searching for some classifier c in a limited subset C 
of all classifiers using a sample S = {xi,yi), . . . , (xm, 2/m) ~ -D™ generated by m 
independent draws from the distribution D. Naturally, this search is guided by 
the empirical error rate. This is the error rate on the subset S defined by: 

-. rn 

es{c) := E(^^y^^sl{c{x) 7^ 2/) = — JI -^(c(a;^) ^ c{yi)). 

where {x, y) ^ S denotes a sample drawn from the uniform distribution on S. 
Note that es(c) is a random variable dependent on a draw from D™ . In contrast, 
eoic) is a number (an expectation) relative to D. 



The Basic Result Our basic result is that certain classifier learning algorithms 
may not behave well as a function of the information they use, even when given 
infinitely many samples to learn from. The learning algorithms we analyze are 
"Bayesian classification" (Bayes), "Maximum a Posteriori classification" (MAP), 
and "Minimum Description Length classification" (MDL) . These algorithms are 
precisely defined later. Functionally they take as arguments a training sample 5* 
and a "prior" P which is a probability distribution over a set of classifiers C . In 
Section 3 we state our basic result, Theorem 2. The theorem has the following 
corollary, indicating suboptimal behavior of Bayes and MDL: 

Corollary 1. (Classification Inconsistency) There exists an input domain 
X , a prior P always nonzero on a countable set of classifiers C , a learning prob- 
lem D, and a constant K > Q such that the Bayesian classifier Cbayes(p,s); i^^ 
MAP classifier Cf,ij^p/pg\, and the MDL classifier c-f^^^i^ipg\ are asymptotically K- 
suboptimal. That is, for each e e {e£.(cBAYEs(P,S))7 eD(cMAP(P,s)), eD(cMDL(RS))}> 
we have 



lim Pr \ e> K +ini eoic) \ = I. 

m^oo5~£)'" \ ceC ) 

How dramatic is this result? We may ask (1) are the priors P for which the 
result holds natural; (2) how large can the constant K become and how small 
can inf cgc CD (c) be? (3) perhaps demanding an algorithm which depends on 



the prior P and the sample S to be consistent (asymptotically optimal) is too 
strong? The short answer to (1) and (2) is: the priors P have to satisfy several 
requirements, but they correspond to priors often used in practice. K can be 
quite large and infce£i(c) can be quite small - see Section 5.1 and Figure 1. 

The answer to (3) is that there do exist simple algorithms which are consis- 
tent. An example is the algorithm which minimizes the Occam's Razor bound 
(ORB) [7], Section 4.2. 

Theorem 1. (ORB consistency) For all priors P nonzero on a set of clas- 
sifiers C, for all learning problems D, and all constants K > Q the ORB classifier 
Corb(p,S) ** asymptotically K-optimal: 



f eD(coRB(RS)) > if + infer, (c) 



lim Pr [ en{cn,>n(p>^\) > K + \ni en{c)\ =0. 

The remainder of this paper first defines precisely what we mean by the above 
classifiers. It then states the main inconsistency theorem which implies the above 
corollary, as well as a theorem that provides an upper-bound on how badly Bayes 
can behave. In Section 4 we prove our theorems. Variations of the result are 
discussed in Section 5.1. A discussion of the result from a Bayesian point of view 
is given in Section 6, and from an MDL point of view in Section 7. 

2 Some Classification Algorithms 

The basic inconsistency result is about particular classifier learning algorithms 
which we define next. 

The Bayesian Classification algorithm The Bayesian approach to inference 
starts with a prior probability distribution P over a set of distributions V which 
typically represents a measure of "belief" that some p d V is the process generat- 
ing data. Bayes' rule states that, given sample data S, the posterior probability 
P(- I S) that some p is the process generating the data is: 

where P{S) :— Ep^pp{S). In classification problems with sample size m = \S\, 
eachp € P is a distribution on (Xxy)"' and the outcome S = (xi, yi), . . . , (a;„i, j/m) 
is the sequence of labeled examples. 

If we intend to perform classification based on a set of classifiers C rather 
than distributions V, it is natural to introduce a "prior" P{c) that a particular 
classifier c : X ^ {0, 1} is the best classifier for solving some learning problem. 
This, of course, is not a Bayesian prior in the conventional sense because classi- 
fiers do not induce a measure over the training data. It is the standard method 
of converting a "prior" over classifiers into a Bayesian prior over distributions 
on the observations which our inconsistency result applies to. 

One common conversion [14, 22, 12] transforms the set of classifiers C into a 
simple logistic regression model - the precise relationship to logistic regression 



is discussed in Section 5.2. In our case c{x) G {0, 1} is binary valued, and then 
(but only then) the conversion amounts to assuming that the error rate 6 of the 
optimal classifier is independent of the feature value x. This is known as "ho- 
moskedasticity" in statistics and "label noise" in learning theory. More precisely, 
it is assumed that, for the optimal classifier c E C\ there exists some 6 such that 
Vx P{c{x) ^ y) — 9. Given this assumption, we can construct a conditional 
probability distribution pj,,e over the labels given the unlabeled data: 

For each fixed 9 < 0.5, the log likelihood logpc,e(y™ | x"^) is linearly decreasing 
in the empirical error that c makes on S. By differentiating with respect to 9, we 
see that for fixed c, the likelihood (1) is maximized by setting 9 :— es{c), giving 

log ^H^^TT = mH{es{c)). (2) 

where H is the binary entropy H{p) =^ —filogfi — (1 — /i) log(l — /i), which is 
strictly increasing for es{c) G [0,0.5). We further assume that some distribution 
Px on X" generates the x-values. We can apply Bayes rule to get a posterior on 
Pc,e, denoted as P{c, 9 \ S), without knowing p^, since the p2,(x'")-factors cancel: 

p. ^ I ^^ Pc,e(2/"Nx'")p.(x'")P(c,g) ^ p,.e{y"'\x"^)P{c,0) 
P(y™ |x")p.(x") -B,,0^ppe,e(2/"|a:™)' 

To make (3) applicable, we need to incorporate a prior measure on the joint 
space C X [0, 1] of classifiers and ^-parameters. In the next section we discuss the 
priors under which our theorems hold. 

Bayes rule (3) is formed into a classifier learning algorithm by choosing the 
most likely label given the input x and the posterior P(-|S'): 

.^ fl ii E,,e^p^.\s)PcAY = l\X = x) > I 
CBAYEs{p,s)ix) — in ,^, • (4) 

10 otherwise. 



The MAP classification algoritiini The integrations of the full Bayesian 
classifier can be too computationally intensive, so we sometimes predict using 
the Bayesian Maximum A Posteriori (MAP) classifier. This classifier is given by: 

Cmap(ps) = argmax max P(c,9 I S) = argmax max Pceiy"^ I x'^)P(c,9) 
^ ' ' cgc ee[o,i] cec ee[o,i] 

with ties broken arbitrarily. Integration over 9 G [0, 1] being much less problem- 
atic than summation over c G C, one sometimes uses a learning algorithm which 
integrates over 9 (like full Bayes) but maximizes over c (like MAP): 

CsMAp(p,s) = argmaxP(c | S) = argmax Eg^p(^g)Pc,eiy"' \ x'^)P{c \ 9). 
^ ' cec cec ^ ' 



The MDL Classification algorithm The MDL approach to classification is 
transplanted from the MDL approach to density estimation. There is no such 
thing as a 'definition' of MDL for classification because the transplant has been 
performed in various ways by various authors. Nonetheless, as we discuss in Sec- 
tion 7, most implementations are essentially equivalent to the following algorithm 
[20,21,15,12]: 

1 / in \ 

Cmdl(p,s) = argmin log — — + log . (5) 

^ ' cec F(c) \mes[c)J 

The quantity minimized has a coding interpretation: it is the number of bits 
required to describe the classifier plus the number of bits required to describe 
the labels on S given the classifier and the unlabeled data. We call — log P(c) + 
log ( .'"(. s) the two-part MDL codelength for encoding data 5* with classifier c. 

3 Main Theorems 

In this section we prove the basic inconsistency theorem. We prove inconsistency 
for some countable set of classifiers C ~ {cq, Ci, . . .} which we define later. The 
inconsistency is attained for priors with 'heavy tails', satisfying 

'^^^-^r^<^'^^^^o{\o%k). (6) 

"(Cfe) 

This condition is satisfied, by, for example, Rissanen's universal prior for the 
integers, [21]. The sensitivity of our result to the choice of prior is analyzed 
further in Section 5.1. The prior on 9 can be any distribution on [0, 1] with a 
continuously differentiable density P bounded away from 0, i.e. for some 7 > 0, 

foraU6'e [0, 1],P(0) >7. (7) 

For example, we may take the uniform distribution with P{0) = 1. We assume 
that the priors P{0) on [0, 1] and the prior P{c) on C are independent, so that 
P{c,6) = P{c)P{9). In the theorem, if(/x) = — /xlog^ — (1 — ^) log(l — ^) stands 
for the binary entropy of a coin with bias /x. 

Theorem 2. (Classification Inconsistency) There exists an input space X 
and a countable set of classifiers C such that the following holds: let P he any 
prior satisfying (6) and (7). For all n G (0,0.5) and all ^' G [^,H{^)/2), there 
exists a D with miucgc eoic) = jjl such that, for all large m, all 6 > 0, 



Pr^ (en(cgMAp(p,s)) ^ l^') > ^ - a„ 
Pr (er,(cMDL(p,s)) ^ l^') > ^ - a„ 



S~D 



S^D 



Pr (eD(cBAYEs(p,s)) > 11' - S) > 1 - am, where «,„ = 3exp(-2\An). 



The theorem states that Bayes is inconsistent for all large m on a fixed distri- 
bution D. This is a significantly more difficult statement than "for all (large) m, 
there exists a learning problem where Bayes is inconsistent"'^. Differentiation of 
0.5H{p,) — fj, shows that the maximum discrepancy between e£)(cMAp(p.s)) Sind ^ 
is achieved for /i = 1/5. With this choice of /i, 0.5-ff (/i) — /i = 0.1609 ... so that, 
by choosing fi' arbitrarily close to H{ii), the discrepancy IJ-' — fJ- comes arbitrarily 
close to 0.1609 . . .. These findings are summarized in Figure 1. 

How large can the discrepancy between fj, — inf^ e£)(c) and /i' ~ e£i(cBAYEs(P.s)) 
be in the large m limit, for general learning problems? Our next theorem, again 
summarized in Figure 1, gives an upperbound, namely, p! < H{fi): 

Theorem 3. (Maximal Inconsistency of Bayes) Let 5* be the sequence 
consisting of the first i examples (xi, yi), . . . , (x^, yi). For all priors P nonzero 
on a set of classifiers C , for all learning problems D with inf cgc e£)(c) — fJ-, for 
all 6 > 0, for all large m, with D"^ -probability > 1 — exp(— 2y^), 

~^\y^^ CBAYEs(P.si-i)(a;j)| < H{ii) + 5. 
1=1 

The theorem says that for large ?n, the total number of mistakes when suc- 
cessively classifying yi given xi made by the Bayesian algorithm based on 5*^^, 
divided by m, is not larger than H{p). By the law of large numbers, it follows that 
for large m, e£)(cBAYEs(p,s»-i)(2;i)), averaged over all i, is no larger than H(ii). 
Thus, it is not ruled out that sporadically, for some i, ei5(cBAYEs(p,s»-i)(2;i)) > 
iJ(/x); but this must be 'compensated' for by most other i. We did not find a 
proof that e£)(cBAYEs(P.s*-i)(2^i)) < -^(/^) f^'' ^^^ large i. 

4 Proofs 

In this section we present the proofs of our three theorems. Theorem 2 and 3 
both make use of the following lemma: 

Lemma 1. There exists 7 > such that for all classifiers c, a > Q, m > {}, all 

5 ^ D™ satisfying a + l/^/m < es{c) < 0.5, all priors satisfying (7): 



Ice < log < 

P(y'" I a;'",c,es(c)) ~ P(y'" | x'",c) ~ 

F(y"^ \ x™',c,es(c)) 2 2 a(l — a) 

Proof, (sketch) For the first inequality, note 

1 1 1 

log 7 ; r == log -T ; ; — T-TT— - > log ■ 



P(2/"|a;™,c) ^/P(y™ I x™,c,6i)P(6')d6' - * P(y™ | x", c, es(c)) ' 



^ In fact, a meta-argument can be made that any nontrivial learning algorithm is 
'inconsistent' in this sense for finite m. 
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Fig. 1. A graph depicting the set of asymptotically allowed error rates for different 
classification algorithms. The a;-axis depicts the optimal classifier's error rate n (also 
shown as the straight line). The lower curve is just 0.5H{fi) and the upper curve is 
H{n). Theorem 2 says that any (/i, /x') between the straight line and the lower curve 
can be achieved for some learning problem D and prior P. Theorem 3 shows that the 
Bayesian learner can never have asymptotic error rate /i' above the upper curve. 



since the likelihood P(y™ | x'^,c,es{c)) is maximized a.t 6 = es{c). For the 
second inequality, note that 

1 /■es(c) + l/V"i 

Piy"" I x"',c,e)p{e)de > / exp(iogF(y'" | x"',c,e)+\ogP{e))de. 

We obtain (8) by expanding logP(j/™ | x™, c, 9) around the maximum 6 — es{c) 
using a second-order Taylor approximation. See, [2] for further details. 



4.1 Inconsistent Learning Algorithms: Proof of Theorem 2 

Below we first define the particular learning problem that causes inconsistency. 
We then analyze the performance of the algorithms on this learning problem. 



The Learning Problem For given /i and /x' > n, we construct a learning 
problem and a set of classifiers C = {cq , ci , . . .} such that cq is the 'good' classifier 



with e£)(co) = fi and ci, C2, . . . are all 'bad' classifiers with eoicj) — fj,' > fi. X 
consists of one binary feature per classifier'*, and the classifiers simply output 
the value of their special feature. The underlying distribution D is constructed 
in terms of /i and /i' and a proof parameter /Xhard > 5 (the error rate for "hard" 
examples). To construct an example {x,y), we first flip a fair coin to determine 
y, so y = 1 with probability 1/2. We then flip a coin with bias phard '■— —^ — 
which determines if this is a "hard" example or an "easy" example. Based upon 
these two coin flips, each Xj is independently generated based on the following 
3 cases. 

1. For a "hard" example, and for each classifier Cj with j > 1, set Xj = \1 — y\ 
with probability /ihard a-nd Xj — y otherwise. 

2. For an "easy" example, and every j > 1 set Xj — y. 

3. For the "good" classifier cq (with true error rate /i), set xq = |1 — y| with 
probability /i and xq — y otherwise. 

The error rates of each classifier are eoico) — [i. and enicj) ~ jjl' for all j >1. 

Bayes and MDL are inconsistent We now prove Theorem 2. In Stage 1 we 
show that there exists a km such that for every value of to, with probability 
converging to 1, there exists some 'bad' classifier Cj with < j < fc^ that 
has empirical error. In Stage 2 we show that the prior of this classifier is 
large enough so that its posterior is exponentially larger than that of the good 
classifier co, showing the convergence ei5(cMAP(p,s)) -^ /^'- In Stage 3 we sketch 
the convergences en(cgMAp(p,s)) -^ A^', eD(c„DL(p,s)) ^ A*', ei5(cBAYEs(p,s)) ^ A*'- 

Stage 1 Let mhard denote the number of hard examples generated within a 
sample S of size to.. Let fc be a positive integer and Ck ~ {cj (z C : 1 < j < k}. 
For all e > and to, > 0, we have: 

Pr {VceCk: es(c)>0) 

= Pr Vc e Cfc : es(c) > > phard + e Pr > Phard + e 

s-n™ V TO / 5~D™ V TO 

+ Pr ( Vc e Cfc : es{c) > | -J^S£± < p^^^^ + g ] Pr ( _J}S£± < p^^^^ ^ g 

S^D'" \ m J S^D^ \ TO 

< e-2'-^ + Pr (Vc e C, : es{c) > | ^^^^ < phaM 

('=) . 2 , . . , (d) 

< e "' 



-2me 



+ (!-(!- Mhard)"^^"""+'^)'' < e-^"''' + e-'=(i-A''."d)"^"'"^+'\ (9) 



Here (a) follows because P{a) = Y,b P{a\b)P{b). (b) follows by Va, P : P{a) < 1 
and the Chernoff bound, (c) holds since (1 — (1 — /ihard)™'-^'""'^^'^'')'^ is monotonic 

* This input space has a countably infinite size. The Bayesian posterior is still com- 
putable for any finite m if we order the features according to the prior of the as- 
sociated classifier. We need only consider features which have an associated prior 
greater than ^ since the minus log-likelihood of the data is always less than m 
bits. Alternatively, we could use stochastic classifiers and a very small input space. 



in e, and (d) by Vx G [0, 1], fc > : (1 - x)'' < e '^^ . We now set e^ := m °-^^ 

9 ^ 

and kim) — r^ : — r. Then (9) becomes 

^ft^(Vc e Cfe(„) : es{c) > 0) < 2e-2^ (10) 

On the other hand, by the ChernofF bound we have Pr5^£)m(es(co) < eoico) — 
Cm) < g-2Vm fQj. ^j^(j optimal classifier cq. Combining this with (10) using the 
union bound, we get that, with I?™-probability larger than 1 — 3e^^^^, the 
following event holds: 

3ceCk{m)- 65(0) = and es(co) > ez5(co) - Cm- (H) 

Stage 2 In the following derivation, we assume that the large probability event 
(11) holds. We show that this implies that for large m, the posterior on some 
c* G Cym\ with eg(c*) = is greater than the posterior on cq, which implies 
that the MAP algorithm is inconsistent. Taking the log of the posterior ratios, 
we get: 

j^ maxeP(co,g|a;'",y'") ^ ^^ maxg P(co)P(g)P(j/'" | x"^,co,0) _ 
°^maxeP(c*,6' I a;™,!/™) °^ maxg P(c*)P(0)P(y™ \x'^,c*,0) 
logmaxP(co)P(0)P(y'" | x™, co, 6I) - logmaxP(c*)P(6')P(2/"' | x"\c*,0). (12) 

9 9 

Using (2) we see that the leftmost term is no larger than 

log (maxP(co)P(0)) • (maxP(y" | a;",co,0')) = -mH{es(co)) + 0(1) < 

9 9 

- mHicDico)) - Kmem + 0(1) = -mH{^) - m^'^^K + 0(1) (13) 

where K is some constant. The last line follows because H{p,) is continuously 
differentiable in a small enough neighborhood around fx. 

For the rightmost term in (12), by the condition on prior p{9), (7), 

-log max P(c*)P(6l)P(y" | x", c*, g) <- log P(c*) + log 7. (14) 



Using condition (6) on prior P{c*) and using c* G Ck(m), we find: 

Pic*) 

where log k{m) = log 2y^ - (mphard + m°-^^) log(l - /Xhard)- Choosing ^hard = 
1/2, this becomes logfc(m) = ^ logm + 2m/z' + to°-^^ + 0(1). Combining this 
with (15), we find that 

log — - — - < 2m^' + o(m) (16) 

P{c*) 

which implies that (14), is no larger than 2to^' + o{m). Since /z' < i/(/i)/2, the 
difference between the leftmost term (13) and the rightmost term (14) in (12) 
is less than for large to, implying that then enicMAPiP.s)) — m'- We derived all 
this from (11) which holds with probability > 1 — 3exp(— 2y^). Thus, for all 
large to, Prs^D™ (cmap(p,s) = m') — ^ ~ 3 exp(— 2-^771), and the result follows. 



Stage 3 (sketch) The proof that the integrated MAP classifier CgMAp(p,s) is 
inconsistent is similar to the proof for Cmap(p,s) that we just gave, except that 
(12) now becomes 

logP(co)P(2/'" I x™, Co) -log P(c*)P(y™ | x'",c*). (17) 

By Lemma 1 we see that, if (11) holds, the difference between (12) and (17) is 
of order O(logTO). The proof then proceeds exactly as for the MAP case. 

To prove inconsistency of Cmdl(p,s)j note that the MDL code length of y™ 
given x™ according to co is given by log {^^i^ -,) • If (H) holds, then a simple Stir- 
ling's approximation as in [12] or [15] shows that log {^^,^ \) — ''nH{es{co)) — 
O (log to). Thus, the difference between two-part codelengths achieved by cq and 
c* is given by 

-TOi7-(es(co))+0(logTO)-logP(c*). (18) 

The proof then proceeds as for the MAP case, with (12) replaced by (18) and a 
few immediate adjustments. 

To prove inconsistency of cbayes(p,s)j wc take /ihard not equal to 1/2 but to 
1/2 + 6 for some small S > 0. By taking S small enough, the proof for Cf^iAp(p,s) 
above goes through unchanged so that, with probability > 1 — 3exp(— 2^^), the 
Bayesian posterior puts all its weight, except for an exponentially small part, on 
a mixture of distributions Pc .e whose Bayes classifier has error rate /i' and error 
rate on hard examples > 1/2. It can be shown that this implies that for large 
TO, the classification error cbayes(p.s) converges to /i'; we omit details. 

4.2 A Consistent Algorithm: Proof of Theorem 1 

In order to prove the theorem, we first state the Occam's Razor Bound classifica- 
tion algorithm, based on minimizing the bound given by the following theorem. 

Theorem 4. (Occam's Razor Bound) [7] For all priors P on a countable set of 
classifiers C , for all distributions D , with probability 1 — 5: 



/in pK +ln7 

Vc: eD{c)<es{c) + \^^ '-. 

V 2to 

We state the algorithm here in a suboptimal form, which good enough for our 
purposes (see [18] for more sophisticated versions): 



In -pT^ + In TO 
«RB(P,S) -^ argmines(c) + y ^- . 

Proof of Theorem 1 Set 5m '■= 1/to. It is easy to see that 



In -57-T + In 771 



s-w+V^^ 



is achieved for at least one c Cz C = {cq, ci, . . .}. Among all Cj G C achieving the 
minimum, let Cm be the one with smallest index j. By the Chernoff bound, we 
have with probability at least 1 — Sm = 1 — 1/m, 



eoic^) > es{cm) ~ y ^^ = esicm) - y ^^, (19) 

whereas by Theorem 4, with probability at least 1 — 5^ = 1 — 1/m, 



, \ ^ ■ ' / \ . I InP(c) +lnm /-lnP(cm) +lnm 

eD(coRB(P.s)) < nnne5(c) + W < es[c^) + \ . 

^ ■ ' cec V 2m V 2m 

Combining this with (19) using the union bound, we find that 



/-lnP(c™) + lnm /Inm 
eoicoRBiP^S)) < eDicrn) + y ^ + V ^^' 

with probability at least 1 — 2/m. The theorem follows upon noting that the 
right-hand side of this expression converges to inf cgc soic) with increasing m. 

4.3 Proof of Theorem 3 

Without loss of generality assume that cq achieves miucgc e_D(c). Consider both 
the 0/1-loss and the log loss of sequentially predicting with the Bayes pre- 
dictive distribution P{Yi = ■ \ Xi = ■,S'-'^) given by P{yi \ Xi,S'-'^) = 
Ec,9~p{-\s^-^)Pc,e{yi\xi). Every time i e {l,...,m} that the Bayes classifier 
based on S**^^ classifies yi incorrectly, P{yi \ Xi,5'*^^) must be < 1/2 so that 
-\ogP{y^ I x^,S'-'^) > 1. Therefore, 

m m 

^ -l0gP(j/, I X,, S'-^) >J2\y^- CBAYES(RS-i)(2;»)|. (20) 

1=1 i-1 

On the other hand we have 

m m 

^-logP(2/, I x,,S'-') = -logH^y^ I x,,x'-\f-') = 

i=l i=l 

^iognP(y. I x-,y'-') = -iog]J l^i ^-iogPiy- I X") = 

1=1 i=l ^^ ' ' 

-log J2 ^(y'"k",c,)P(c,)<-logP(y"|x™,co)-logP(co), (21) 

i=0,l,2... 

where the inequality follows because a sum is larger than each of its terms. By 
the Chernoff bound, for all small enough e > 0, with probability larger than 
1 — 2exp(— 2me^), we have |e5(co) — eD(co)| < e. We now set £„ — m,^"-^^. 
Then, using Lemma 1, with probability larger than 1 — 2exp(— 2-ym), for all 
large m (21) is less than or equal to 

- logP(y™ I x"\ Co, e(co)) + - logm + C\n = mH{es{co)) + - logm + C„ < 

mHieoico)) + Km''-^'' + ^ logm + Cm. (22) 



where Cm = (eD(co) - e„ - m "■''^) ^(1 - eoico) + e,; 

constant not depending on 5 = 5*™. Here (a) follows from Equation 2 and (b) 

follows because H{ii) is continuously differentiablc in a neighborhood of /i. 

Combining (22) with (20) and using Cm ~ 0(1) we find that with probability 
> 1 - cxp{-2y/m), X;r=i \yi -CBAYEs(P,S'-i)(a;i)l < mH(eD{co)) + o(m), QED. 

5 Technical Discussion 

5.1 Variations of Theorem 2 and dependency on the prior 

Prior on classifiers The requirement (6) that — log P(cfc) > logfe + o(logfc) is 
needed to obtain (16), which is the key inequality in the proof of Theorem 2. If 
P{ck) decreases at polynomial rate, but at a degree d larger than one, i.e. if 

-logP(cfe)=dlogfc + o(logfc), (23) 

then a variation of Theorem 2 still applies but the maximum possible dis- 
crepancies between /i and n' become much smaller: essentially, if we require 
(1 < jj! < j^H(p) rather than n < n' < ^H{ii) as in Theorem 2, then the ar- 
gument works for all priors satisfying (23). Since the derivative dH{n)/diJ, -^ oo 
as /z J, 0, by setting fi close enough to it is possible to obtain inconsistency for 
any fixed polynomial degree of decrease d. However, the higher d, the smaller 
^ = inf cgc sd{c) must be to get any inconsistency with our argument. 

Prior on error rates Condition (7) on the prior on the error rates is satisfied 
for most reasonable priors. Some approaches to applying MDL to classification 
problems amount to assuming priors of the form p{9*) = 1 for a single 6* G [0, 1] 
(Section 7). In that case, we can still prove a version of Theorem 2, but the 
maximum discrepancy between /i and /i' may now be either larger or smaller 
than H(p)/2 — /i, depending on the choice of 6* . 

5.2 Properties of the transformation from classifiers to distributions 

Optimality and Reliability Assume that the conditional distribution of y given x 
according to the 'true' underlying distribution D is defined for all x d X, and let 
PD{y\x) denote its mass function. Define A(pc.e) as the KuUback-Leibler (KL) 
divergence [9] between pc.e and the 'true' conditional distribution pjj: 

^{Pc.e) := KL{pD\\pc,e) ^ E(x,y)r~.Dh ^ogpc.e{y\x) + logpD{y\x)]. 

Proposition 1. Let C be any set of classifiers, and let c* ^ C achieve 
minceceD(c) =e£)(c*). 

1. Ifeoic*) < 1/2, then 

min A{pc.8) is uniquely achieved for {c,9) — (c*, e_D(c*)). 

2. mmc.e A{pc^0) = iff Pceoic) is 'true', i.e. if 'ix,y : Pc%eo(c*)(y|a;) = 

PD{y\x). 



Property 1 follows since for each fixed c, minggro.i] A{pc^e) is uniquely achieved 
for 9 = eu{c) (this follows by differentiation) and satisfies uAng A{pc,e) = 
Z\(Pc,eD(c)) = H{eD{c)) — Kd, where Kd = E[\ogpD{y\x)] does not depend 
on c or 9, and H{fi) is monotonically increasing for /i < 1/2. Property 2 follows 
from the information inequality [9]. 

Proposition 1 implies that our transformation is a good candidate for turning 
classifiers into probability distributions. 

Namely, let V = {pa '■ oi e A\ be a set of i.i.d. distributions indexed by 
parameter set A and let P{ci) be a prior on A. By the law of large numbers, 
for each a £ A, m^^\ogpa{y"^ \ x"^)P{a) -^ KL(pd||pq). By Bayes rule, this 
implies that if the class V is 'small' enough so that the law of large numbers 
holds uniformly for all pa (z V, then for all e > 0, the Bayesian posterior will 
concentrate, with probability 1, on the set of distributions in V within e of the 
p* Cz V minimizing KL-divergence to D. In our case, if C is 'simple' enough so 
that the corresponding V — {pc,0 : c G C, G [0, 1]} admits uniform convergence 
[12], then the Bayesian posterior asymptotically concentrates on the Pc'-^e* € V — 
{Pc,e} closest to D in KL-divergence. By Proposition 1, this Pc'.e* corresponds 
to the c* € C with smallest generalization error rate eu{c*) {pc*,e* is optimal 
for 0/1-loss), and for the 9* G [0,1] with 9* = eoic*) {pc'fi' gives a reliable 
impression of its prediction quality). This convergence to an optimal and reliable 
Pc'^e' will happen if, for example, C has finite VC-dimension [12]. We can only 
get trouble as in Theorem 2 if we allow C to be of infinite VC-dimension. 

Analogy to Regression In ordinary (real-valued) regression, 3^ = R, and one tries 
to learn a function f E T from the data. Here !F is a. set of candidate functions 
A" ^ 3^. In order to apply Bayesian inference to this problem, one assumes a 
probability model V expressing Y = f{X) + Z, where Z is independent noise 
with mean and variance a^. V then consists of conditional density functions 
Pf^a^, one for each j E T and cr^ > 0. It is well known that if one assumes Z 
to be normally distributed independently of X, then the P/^o-a become Gaussian 
densities and the likelihood becomes a linear junction of the mean squared error 
[21]: 

n 

-lnp/,,2(2/"|x")=/3„^(2/,-/(x,))2+nlnZ(/3„). (24) 

i=l 

where we wrote /3o- — l/2cr^ and Z{(3) ~ J y cxp{—/3y^)dy. Because least 
squares is an intuitive, mathematically well-behaved and easy to perform pro- 
cedure, it is often assumed in Bayesian regression that the noise is normally 
distributed - even in cases where in reality, it is not [12, 16]. 

Completely analogously to the Gaussian case, our transformation maps clas- 
sifiers c and noise rates 9 to distributions pc_9 so that the likelihood becomes a 
linear function of the 0/1-error, since it can be written as: 

n 

-\npcAy" I x") = l30Y.\y^ - c(x,)| +nlnZ(/3e). (25) 



where we wrote /^e = ln(l — 0)—ln0 and Z(/3) ^J2yey'^^'P(~(^y) [12, 19]. Indeed, 
the models {pc.e} are a special case of logistic regression models, which we now 
define: 

Logistic regression interpretation let C be a set of functions X ^ y, where 3^ C R 
(3^ does not need to be binary- valued) . The corresponding logistic regression 
model is the set of conditional distributions {pc,i3 : c € C; /3 € M} of the form 

P-A^ ' "^^ ■= 1 + e-M^) ' PcA^\^)-= i + e-M^) - (26) 

This is the standard construction used to convert classifiers with real-valued 
output such as support vector machines and neural networks into conditional 
distributions [14,22], so that Bayesian inference can be applied. By setting C 
to be a set of {0, l}-valued classifiers, and substituting (3 — ln(l ~ 9) — \n6 as 
in (25), we see that our construction is a special case of the logistic regression 
transformation (26). It may seem that (26) docs not treat y = 1 and y = 
on equal footing, but this is not so: we can alternatively define a symmetric 
version of (26) by defining, for each c G C, a corresponding c' : X ^ {—1,1}, 
c'{x) :— 2c{x) — 1. Then we can set 



e 



'(3c{x) pl3c{x) 



By setting /?' — 2/3 we see that pc^p as in (26) is identical to Pc,/3' as in (27), so 
that the two models really coincide. 



6 Interpretation from a Bayesian perspective 

Bayesian Consistency It is well-known that Bayesian inference is strongly 
consistent under very broad conditions. For example, when applied to our setting, 
the celebrated Blackwell-Dubins consistency theorem [6] says the following. Let 
C be countable and suppose D is such that, for some c* G C and 9* S [0,1], 
Pc',9' is equal to po, the true distribution/ mass function of y given x. Then 
with £)-probability 1, the Bayesian posterior concentrates on c*: lim„i^oo P{c* \ 
S"") = 1. 

Consider now the learning problem underlying Theorem 2 as described in 
Section 4.1. Since co achieves miUcgc ^d{c), it follows by part 1 of Proposition 1 
that mincB A{pc,e) = A{pcg^eD{co))- U ^{Pco,eD{co)) were 0, then by part 2 of 
Proposition 1, Blackwell-Dubins would apply, and we would have P(co ] 5™) — > 
1. Theorem 2 states that this does not happen. It follows that the premisse 
^{Pco.eoico)) — must be false. But since Z\(pc,fl) is minimized for (cq, e£)(co)), 
the Proposition implies that for no c Cz C and no 9 G [0, 1], Pcfi is equal to Pd{\) 
- in statistical terms, the model V — {pc,e '■ c (z C,9 (z [0, 1]} is misspecified. Thus, 
our result can be interpreted in two ways: 



1. 'ordinary' Bayesian inference can be inconsistent under mis specification: We 
exhibit a simple logistic regression model V and a true distribution D such 
that, with probability 1, the Bayesian posterior does not converge to the dis- 
tribution Pca.en(co) S V that minimizes, among all p (z V, the KL-divergence 
to D, even though Pco,er>(co) ti^s substantial prior mass and is partially cor- 
rect in the sense that Cg, the Bayes optimal classifier relative to Pco,eD(co)> 
has true error rate e£i(co), which is the same true error rate that it would 
have if Pco,eD(co) were 'true'. 

2. 'pragmatic' Bayesian inference for classification can be suboptimal: a stan- 
dard way to turn classifiers into distributions so as to make application of 
Bayesian inference possible may give rise to suboptimal performance. 

Two types of misspecification Pco.eo(co) <^^^ be misspecified in two different 
ways. Pco,eD(co) expresses that y = co{x) xor z where z is a noise bit generated 
independently of x. This statement may be wrong either because (a) cq is not 
the Bayes optimal classifier according to D; or (b) cq is Baycs optimal, but 
z is dependent on x under D. The way we defined our learning problem D 
(Section 4.1) is an example of case (a). But we could have equally defined cq as 
follows: we replace step 3 of the generation of input values Xj by the following 
procedure: for an easy example, we set xq = 2/o- For a hard example, we set xq = 
|1 — j/ol with probability pL/2pi' . Then, we can take /Xhard = 1/2 and the proof of 
Theorem 2 holds unchanged. But now cq is the Bayes optimal classifier relative to 
D, as is easy to see. Thus, Bayesian inference can be inconsistent for classification 
in both case (a) (no Bayes act in C) and case (b) (heteroskedasticity). 

Why is the result interesting for a Bayesian? Here we answer several 
objections that a Bayesian might have to our work. 

Bayesian inference has never been designed to work under misspecification. So 
why is the result relevant? 

We would maintain that in practice^ Bayesian inference is applied all the time 
under misspecification in classification problems [12]. It is very hard to avoid 
misspecification with Bayesian classification, since the modeler often has no idea 
about the noise-generating process. Even though it may be known that noise is 
not homoskedastic, it may be practically impossible to incorporate all ways in 
which the noise may depend on x into the prior. 

It is already well-known that Bayesian inference can be inconsistent even if V is 
well-specified, i.e. if it contains D [10]. So why is our result interesting? 
The (in)famous inconsistency results by Diaconis and Freedman [10] are based 
on nonparametric inference with uncountable sets V. Their theorems require 
that the true p has small prior density, and in fact prior mass (see also [1]). 
In contrast. Theorem 2 still holds if we assign Pco,eo(co) arbitrarily large prior 
mass < 1, which, by the Blackwell-Dubins theorem, guarantees consistency if 
V is well-specified. We show that consistency may still fail dramatically if P is 



misspecificd. This is interesting because even under misspecification, Bayes is 
consistent under fairly broad conditions [8, 16], in the sense that the posterior 
concentrates on a neighborhood of the distribution that minimizes KL-divergence 
to the true D. Thus, we feel our result is relevant at least from the inconsistency 
under misspecification interpretation. 

So how can our result co- exist with theorems establishing Bayesian consistency 
under misspecification? 

Such results are typically proved under either one of the following two assump- 
tions: 

1. The set of distributions V is 'simple', for example, finite-dimensional para- 
metric. In such cases, ML estimation is usually also consistent - thus, for 
large m the role of the prior becomes negligible. In case V corresponds to a 
classification model C, this would obtain, for example, if C were finite or had 
finite VC-dimension. 

2. V may be arbitrarily large or complex, but it is convex: any finite mixture 
of elements of V is an element of V . An example is the family of Gaussian 
mixtures with an arbitrary but finite number of components [17]. 

It is clear that our setup violates both conditions: C has infinite VC-dimension, 
and the corresponding V is not closed under taking mixtures. This suggests that 
we could make Bayes consistent again if, instead oiV, we would base inferences 
on its convex closure V. Computational difficulties aside, this approach will not 
work, since we now use the crucial part (1) of Proposition 1 will not hold any 
more: the conditional distribution in V closest in KL-divergence to the true 
P£)(j/|x), when used for classification, may end up having larger generalization 
error (expected 0/1-loss) than the optimal classifier c* in the set C on which 
V was based. We will give an explicit example of this in the journal version of 
this paper. Thus, with a prior on V , the Bayesian posterior will converge, but 
potentially it converges to a distribution that is suboptimal in the performance 
measure we are interested in. 

How 'standard' is the conversion from classifiers to probability distributions on 
which our results are based? 

One may argue that our notion of 'converting' classifiers into probability distri- 
butions is not always what Bayesians do in practice. For classifiers which produce 
real-valued output, such as neural networks and support vector machines, our 
transformation coincides with the logistic regression transformation, which is a 
standard Bayesian tool; see for example [14, 22]. But our theorems are based on 
classifiers with 0/1-output. With the exception of decision trees, such classifiers 
have not been addresses frequently in the Bayesian literature. Decision trees 
have usually been converted to conditional distributions differently, by assuming 
a different noise rate in each leaf of the decision tree [13]. This makes the set of 
all decision trees on a given input space X coincide with the set of all conditional 
distributions on X, and thus avoids the misspecification problem, at the cost of 
using a much larger model space. 



Thus, we have to concede that here is a weak point in our analysis: we use a 
transformation that has mostly been applied to real-valued classifiers, whereas 
our classifiers are 0/f-valued. Whether our inconsistency results can be extended 
in a natural way to classifiers with real-valued output remains to be seen. The 
fact that the Bayesian model corresponding to such neural networks will still 
typically be misspecified suggests (but does not prove) that similar scenarios 
may be constructed. 

7 Interpretation from an MDL Perspective 

From an MDL Perspective, the relevance of our results needs much less discus- 
sion: the two-part code formula (5) has been used for classification by various 
authors; see, e.g., [21, 20] and [15]. [12] first noted that in this form, by using Stir- 
ling's approximation, (5) is essentially equivalent to MAP classification based on 
the models pc.e as defined in Section 2. Of course, there exist more refined ver- 
sions of MDL based on one-part rather than two-part codes [2] . To apply these 
to classification, one somehow has to map classifiers to probability distributions 
explicitly. This was already anticipated by Meir and Merhav [19] who used the 
transformation described in this paper to define one-part codes. The resulting 
approach is closely related to the Bayesian posterior approach Cbayes(p,s)i sug- 
gesting that a version of our inconsistency Theorem 2 still applies. Rissanen [21] 
considered mapping classifiers C to distributions {pc,e*} to a single value of 9* , 
e.g., 9* = 1/3. As discussed in Section 5.1, a version of Theorem 2 still applies 
to the resulting distributions. 

How to code hypotheses - choice of codes and priors It may seem that our 
results are in line with the investigation of Kearns et al. [15]. This, however, 
is not clear - Kearns et al. consider a scenario in which two-part code MDL 
for classification shows quite bad experimental performance, and MDL must be 
'consistent'. Indeed, Kearns et al. observe that for However, according to [23], 
this is caused by the coding method used to encode hypotheses. This method 
does not take into account the precision of parameters involved. In the paper 
[23], a slightly different coding scheme is proposed. With this coding scheme, 
MDL apparently behaves quite well on the classification problem studied by 
Kearns et al. 

One may transplant the arguments of Viswanathan et al. [23] to our setup: 
we can only prove inconsistency for specific choices of the prior, corresponding to 
particular ways of coding hypotheses. In practice, one usually employs hypothe- 
ses of a different nature than we do, and one can use properties of hypotheses 
such as the precision with which they are specified to come up with 'reasonable' 
priors/codes, which possibly do not suffer any inconsistency problems. However, 
the intriguing fact remains that if a probabilistic model V is well-specified, then 
under very broad conditions MDL is consistent [4] - under almost no conditions 
on the prior. Our work shows that if a set of classifiers C is used (correspond- 
ing to a misspecified probability model V), then the choice of prior becomes of 
crucial importance, even with an infinite amount of data. 



Related Work Yamanishi and Barron [24, 3] proposed modifications of the two- 
part MDL coding scheme so that it wouid be apphcable for inference with respect 
to general classes of predictors and loss functions, including classification with 
0/1-loss as a special case. Both Yamanishi and Barron prove the consistency 
(and give rates of convergence) for their procedures. Similarly, McAUester's PAC- 
Bayesian method [18] can be viewed as a modification of Bayesian inference that 
is provably consistent for classification, based on sophisticated extensions of the 
Occam's Razor bound. Theorem 4. These modifications anticipate our result, 
since it must have been clear to the authors that without the modification, MDL 
(and discrete Bayesian MAP) are not consistent for classification. Nevertheless, 
we seem to be the first to have explicitly formalized and proved this. 
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